For my end of studies mini-senior project, I found a dataset containing Airbnb rental data for European cities for over 51,000 listings. The dataset includes features such as the total listing price, room type, host status, amenities and location information which can be exploited to analyze these factors’ relation to the Airbnb prices. For more information see: https://www.kaggle.com/datasets/thedevastator/airbnb-price-determinants-in-europe?resource=download
We have data for several European cities for both weekdays & week-ends. So let’s first begin by importing all the different data into one aggregate set:
# data frame for storing dataset
combined_data <- data.frame()
# list of cities & data types
cities <- c("amsterdam", "athens", "barcelona", "berlin", "budapest", "lisbon", "london", "paris", "rome", "vienna")
data_types <- c("weekdays", "weekends")
# import data from each file into combined_data
for (city in cities) {
for (data_type in data_types) {
# file_path for CSV file
file_path <- paste("data/", city, "_", data_type, ".csv", sep = "")
# import CSV file
city_data <- read.csv(file_path)
# Add variables to identify city and data_type (weekend or week-day)
city_data$city <- city
city_data$data_type <- data_type
# Import into combined data
combined_data <- rbind(combined_data, city_data)
}
}
Now let’s check the top rows of the data to get an idea on what we’re working with:
head(combined_data, 15)
## X realSum room_type room_shared room_private person_capacity
## 1 0 194.0337 Private room False True 2
## 2 1 344.2458 Private room False True 4
## 3 2 264.1014 Private room False True 2
## 4 3 433.5294 Private room False True 4
## 5 4 485.5529 Private room False True 2
## 6 5 552.8086 Private room False True 3
## 7 6 215.1243 Private room False True 2
## 8 7 2771.3074 Entire home/apt False False 4
## 9 8 1001.8044 Entire home/apt False False 4
## 10 9 276.5215 Private room False True 2
## 11 10 909.4744 Entire home/apt False False 2
## 12 11 319.6401 Private room False True 2
## 13 12 675.6028 Entire home/apt False False 4
## 14 13 552.8086 Entire home/apt False False 2
## 15 14 209.0315 Private room False True 2
## host_is_superhost multi biz cleanliness_rating guest_satisfaction_overall
## 1 False 1 0 10 93
## 2 False 0 0 8 85
## 3 False 0 1 9 87
## 4 False 0 1 9 90
## 5 True 0 0 10 98
## 6 False 0 0 8 100
## 7 False 0 0 10 94
## 8 True 0 0 10 100
## 9 False 0 0 9 96
## 10 False 1 0 10 88
## 11 False 0 0 10 96
## 12 True 1 0 10 97
## 13 False 0 0 8 87
## 14 True 0 0 10 100
## 15 False 1 0 8 96
## bedrooms dist metro_dist attr_index attr_index_norm rest_index
## 1 1 5.0229638 2.5393800 78.69038 4.166708 98.25390
## 2 1 0.4883893 0.2394039 631.17638 33.421209 837.28076
## 3 1 5.7483119 3.6516213 75.27588 3.985908 95.38695
## 4 2 0.3848620 0.4398761 493.27253 26.119108 875.03310
## 5 1 0.5447382 0.3186926 552.83032 29.272733 815.30574
## 6 2 2.1314201 1.9046682 174.78896 9.255191 225.20166
## 7 1 1.8810916 0.7297467 200.16765 10.599010 242.76552
## 8 3 1.6868070 1.4584036 208.80811 11.056528 272.31382
## 9 2 3.7191414 1.1961124 106.22646 5.624761 133.87620
## 10 1 3.1423614 0.9244044 206.25286 10.921226 238.29126
## 11 1 1.0099220 0.9171151 409.85812 21.702260 555.11428
## 12 1 2.1827071 1.5903814 191.50134 10.140123 229.29740
## 13 1 2.9330458 0.6280730 214.92334 11.380334 269.62490
## 14 1 1.3054939 1.3421624 325.25595 17.222519 390.91205
## 15 1 7.3045353 3.7208139 59.77618 3.165188 75.70106
## rest_index_norm lng lat city data_type
## 1 6.846473 4.90569 52.41772 amsterdam weekdays
## 2 58.342928 4.90005 52.37432 amsterdam weekdays
## 3 6.646700 4.97512 52.36103 amsterdam weekdays
## 4 60.973565 4.89417 52.37663 amsterdam weekdays
## 5 56.811677 4.90051 52.37508 amsterdam weekdays
## 6 15.692376 4.87699 52.38966 amsterdam weekdays
## 7 16.916251 4.91570 52.38296 amsterdam weekdays
## 8 18.975219 4.88467 52.38749 amsterdam weekdays
## 9 9.328686 4.86459 52.40175 amsterdam weekdays
## 10 16.604478 4.87600 52.34700 amsterdam weekdays
## 11 38.681161 4.87956 52.36953 amsterdam weekdays
## 12 15.977773 4.92496 52.37107 amsterdam weekdays
## 13 18.787851 4.88934 52.34697 amsterdam weekdays
## 14 27.239314 4.87417 52.37509 amsterdam weekdays
## 15 5.274959 4.99679 52.35645 amsterdam weekdays
summary(combined_data)
## X realSum room_type room_shared
## Min. : 0 Min. : 34.78 Length:51707 Length:51707
## 1st Qu.: 646 1st Qu.: 148.75 Class :character Class :character
## Median :1334 Median : 211.34 Mode :character Mode :character
## Mean :1621 Mean : 279.88
## 3rd Qu.:2382 3rd Qu.: 319.69
## Max. :5378 Max. :18545.45
## room_private person_capacity host_is_superhost multi
## Length:51707 Min. :2.000 Length:51707 Min. :0.0000
## Class :character 1st Qu.:2.000 Class :character 1st Qu.:0.0000
## Mode :character Median :3.000 Mode :character Median :0.0000
## Mean :3.162 Mean :0.2914
## 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :6.000 Max. :1.0000
## biz cleanliness_rating guest_satisfaction_overall
## Min. :0.0000 Min. : 2.000 Min. : 20.00
## 1st Qu.:0.0000 1st Qu.: 9.000 1st Qu.: 90.00
## Median :0.0000 Median :10.000 Median : 95.00
## Mean :0.3502 Mean : 9.391 Mean : 92.63
## 3rd Qu.:1.0000 3rd Qu.:10.000 3rd Qu.: 99.00
## Max. :1.0000 Max. :10.000 Max. :100.00
## bedrooms dist metro_dist attr_index
## Min. : 0.000 Min. : 0.01504 Min. : 0.002301 Min. : 15.15
## 1st Qu.: 1.000 1st Qu.: 1.45314 1st Qu.: 0.248480 1st Qu.: 136.80
## Median : 1.000 Median : 2.61354 Median : 0.413269 Median : 234.33
## Mean : 1.159 Mean : 3.19129 Mean : 0.681540 Mean : 294.20
## 3rd Qu.: 1.000 3rd Qu.: 4.26308 3rd Qu.: 0.737840 3rd Qu.: 385.76
## Max. :10.000 Max. :25.28456 Max. :14.273577 Max. :4513.56
## attr_index_norm rest_index rest_index_norm lng
## Min. : 0.9263 Min. : 19.58 Min. : 0.5928 Min. :-9.2263
## 1st Qu.: 6.3809 1st Qu.: 250.85 1st Qu.: 8.7515 1st Qu.:-0.0725
## Median : 11.4683 Median : 522.05 Median : 17.5422 Median : 4.8730
## Mean : 13.4238 Mean : 626.86 Mean : 22.7862 Mean : 7.4261
## 3rd Qu.: 17.4151 3rd Qu.: 832.63 3rd Qu.: 32.9646 3rd Qu.:13.5188
## Max. :100.0000 Max. :6696.16 Max. :100.0000 Max. :23.7860
## lat city data_type
## Min. :37.95 Length:51707 Length:51707
## 1st Qu.:41.40 Class :character Class :character
## Median :47.51 Mode :character Mode :character
## Mean :45.67
## 3rd Qu.:51.47
## Max. :52.64
Now let’s include the libraries we need to move forward:
library(ggplot2)
library(tidyr)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attachement du package : 'randomForest'
## L'objet suivant est masqué depuis 'package:ggplot2':
##
## margin
library(caret)
## Le chargement a nécessité le package : lattice
library(leaflet)
library(dplyr)
##
## Attachement du package : 'dplyr'
## L'objet suivant est masqué depuis 'package:randomForest':
##
## combine
## Les objets suivants sont masqués depuis 'package:stats':
##
## filter, lag
## Les objets suivants sont masqués depuis 'package:base':
##
## intersect, setdiff, setequal, union
library(sf)
## Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)
library(ggplotify)
library(grid)
Let’s do a data clean-up. I want to limit outliers by removing the rows with the highest and lowest 10% of listing price, turn our boolean text variables to integer binary variables, and create integer dummy variables for cities and for the listing being for weekends or weekdays.
# Convert text variables to boolean integers
combined_data$room_shared <- ifelse(combined_data$room_shared == "False", 0, 1)
combined_data$room_private <- ifelse(combined_data$room_private == "False", 0, 1)
combined_data$host_is_superhost <- ifelse(combined_data$host_is_superhost == "False", 0, 1)
# Create dummy variables to represent data_type
combined_data$for_weekends <- as.integer(combined_data$data_type == "weekends")
combined_data$for_weekdays <- as.integer(combined_data$data_type == "weekdays")
# Create dummy variable to represent full houses and apartments
combined_data$full_home <- as.integer(combined_data$room_type != "Private room")
# Add a dummy variable for each city
encoded_cities <- model.matrix(~ 0 + city, data = combined_data)
colnames(encoded_cities) <- sub("city", "", colnames(encoded_cities))
combined_data <- cbind(combined_data, encoded_cities)
# Remove 10% of outliers in terms of listing price both from the bottom and the top
percentile_10 <- quantile(combined_data$realSum, 0.1)
percentile_90 <- quantile(combined_data$realSum, 0.9)
filtered_data <- combined_data %>%
filter(realSum >= percentile_10, realSum <= percentile_90)
original_data <- combined_data
combined_data <- filtered_data
Now let’s summarize our transformed data
summary(combined_data)
## X realSum room_type room_shared
## Min. : 0 Min. :113.2 Length:41369 Min. :0.000000
## 1st Qu.: 654 1st Qu.:160.4 Class :character 1st Qu.:0.000000
## Median :1365 Median :211.3 Mode :character Median :0.000000
## Mean :1638 Mean :234.8 Mean :0.005125
## 3rd Qu.:2415 3rd Qu.:289.4 3rd Qu.:0.000000
## Max. :5378 Max. :500.8 Max. :1.000000
## room_private person_capacity host_is_superhost multi
## Min. :0.0000 Min. :2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :3.000 Median :0.0000 Median :0.0000
## Mean :0.3684 Mean :3.103 Mean :0.2636 Mean :0.2926
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.000 Max. :1.0000 Max. :1.0000
## biz cleanliness_rating guest_satisfaction_overall bedrooms
## Min. :0.0000 Min. : 2.000 Min. : 20.00 Min. :0.000
## 1st Qu.:0.0000 1st Qu.: 9.000 1st Qu.: 90.00 1st Qu.:1.000
## Median :0.0000 Median :10.000 Median : 95.00 Median :1.000
## Mean :0.3476 Mean : 9.407 Mean : 92.73 Mean :1.112
## 3rd Qu.:1.0000 3rd Qu.:10.000 3rd Qu.: 98.00 3rd Qu.:1.000
## Max. :1.0000 Max. :10.000 Max. :100.00 Max. :9.000
## dist metro_dist attr_index attr_index_norm
## Min. : 0.03466 Min. : 0.002301 Min. : 15.15 Min. : 0.9263
## 1st Qu.: 1.42561 1st Qu.: 0.249888 1st Qu.: 143.33 1st Qu.: 6.7956
## Median : 2.63400 Median : 0.414120 Median : 237.73 Median : 11.5968
## Mean : 3.17955 Mean : 0.676397 Mean : 296.63 Mean : 13.1757
## 3rd Qu.: 4.28928 3rd Qu.: 0.734352 3rd Qu.: 382.28 3rd Qu.: 16.9250
## Max. :25.28456 Max. :14.273577 Max. :4513.56 Max. :100.0000
## rest_index rest_index_norm lng lat
## Min. : 19.58 Min. : 0.6407 Min. :-9.22634 Min. :37.95
## 1st Qu.: 271.43 1st Qu.: 8.9507 1st Qu.:-0.07504 1st Qu.:41.41
## Median : 534.31 Median : 18.3118 Median : 4.86326 Median :47.50
## Mean : 641.54 Mean : 23.1876 Mean : 7.17219 Mean :45.62
## 3rd Qu.: 836.29 3rd Qu.: 33.5754 3rd Qu.:13.44577 3rd Qu.:51.45
## Max. :6696.16 Max. :100.0000 Max. :23.78602 Max. :52.64
## city data_type for_weekends for_weekdays
## Length:41369 Length:41369 Min. :0.0000 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Mode :character Median :1.0000 Median :0.0000
## Mean :0.5068 Mean :0.4932
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## full_home amsterdam athens barcelona
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :1.0000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.6316 Mean :0.02826 Mean :0.07994 Mean :0.05777
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## berlin budapest lisbon london
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.05248 Mean :0.07982 Mean :0.1228 Mean :0.1849
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## paris rome vienna
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1274 Mean :0.1882 Mean :0.07842
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
Now let’s proceed with some basic exploratory data analysis just to get an idea on how the listing price varies
# Price histogram
ggplot(combined_data, aes(x = realSum)) +
geom_histogram(fill = "steelblue", bins = 30) +
labs(x = "Price", y = "Frequency", title = "Distribution of Prices") +
theme_minimal()
# Price by room type
ggplot(combined_data, aes(x = room_type, y = realSum)) +
geom_boxplot(fill = "steelblue") +
labs(x = "Room Type", y = "Price", title = "Price Variation by Room Type") +
theme_minimal()
# Price vs. distance to metro
ggplot(combined_data, aes(x = metro_dist, y = realSum)) +
geom_point(color = "steelblue") +
labs(x = "Distance to Metro", y = "Price", title = "Price vs. Distance to Metro") +
theme_minimal()
# Price by city
ggplot(combined_data, aes(x = city, y = realSum)) +
geom_boxplot(fill = "steelblue") +
labs(x = "City", y = "realSum", title = "Distribution of realSum by City") +
theme_minimal()
# Price by data_type (weekends and weekdays)
ggplot(combined_data, aes(x = data_type, y = realSum)) +
geom_boxplot(fill = "steelblue") +
labs(x = "Data Type", y = "realSum", title = "Distribution of realSum by Date") +
theme_minimal()
We can see that the listing price isn’t normally distributed. And we can also see that entire homes and apartments are priced higher than private rooms, which themselves are priced higher that shared rooms. And we can see that distance to metro_stations is somewhat negatively correlated to the listing price. And we can see that there is significant variation between cities. but listing prices between week_days and weekends aren’t very different.
Let’s take a look at the number of listings per city in our sample
# Define the cities
cities <- c("amsterdam", "athens", "barcelona", "berlin", "budapest", "lisbon", "london", "paris", "rome", "vienna")
# Generate a color palette
n_colors <- length(cities)
color_palette <- brewer.pal(n_colors, "Set3")
# Create a named vector of colors
city_colors <- setNames(color_palette, cities)
# Number of listings per city with numbers in the legend
whole_dataset_pie <- original_data %>%
group_by(city) %>%
summarize(count = n()) %>%
ggplot(aes(x = "", y = count, fill = city)) +
geom_bar(stat = "identity", width = 1) +
geom_text(aes(label = count), position = position_stack(vjust = 0.5), size = 2.5, color = "black") + # Add labels to the bars
coord_polar(theta = "y") +
scale_fill_manual(values = city_colors) +
labs(title = "Number of Listings per City (Whole Dataset)") +
theme_void() # Use theme_void to create a clear background
# Print the pie chart with numbers in the legend
print(whole_dataset_pie)
Let’s try looking at these on a map
# Create an sf object
combined_sf <- st_as_sf(original_data, coords = c("lng", "lat"), crs = 4326)
# Create a base leaflet map
m <- leaflet() %>%
addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% # Use high-resolution tile source
setView(lng = 10, lat = 50, zoom = 6) # Increased initial zoom level
# Add markers with different colors based on room_type
m <- m %>%
addCircleMarkers(
data = combined_sf,
fillColor = ~case_when(
room_type == "Entire home/apt" ~ "blue",
room_type == "Private room" ~ "orange",
room_type == "Shared room" ~ "red",
TRUE ~ "blue" # Use a default color for other cases
),
fillOpacity = 0.7, # Adjust opacity
radius = 5, # Adjust marker size
group = "Airbnb Listings", # Group for layer control
popup = ~paste("City: ", city, "<br>Room Type: ", room_type) # Popup content
)
# Add layer control for toggling layers on/off
m <- m %>%
addLayersControl(overlayGroups = "Airbnb Listings", position = "topleft")
# Center the map and improve appearance
m <- m %>%
setView(lng = 10, lat = 50, zoom = 6) %>%
htmlwidgets::onRender("
function(el, x) {
setTimeout(function() {
map.invalidateSize();
}, 100);
}
")
# Display the map
m
Let’s proceed with a correlation matrix to get an idea on which explanatory variables correlate to the listing price, and also get an idea on the correlation between dependent variables.
# Correlation matrix
cor_matrix <- cor(combined_data[, c("realSum", "person_capacity", "cleanliness_rating", "guest_satisfaction_overall", "bedrooms", "dist", "metro_dist", "rest_index", "attr_index", "rest_index_norm", "attr_index_norm", "lng", "lat", "biz", "host_is_superhost", "room_shared", "room_private", "for_weekends", "for_weekdays", "full_home", "multi")])
# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust", tl.cex = 0.7)
# Center and improve appearance
par(mar = c(1, 1, 1, 1))
I chose to implement a random forest algorithm to see if I can find non-linear relationships between pricing and the exploratory variables, since a conventional linear regression did not work on our aggregate data. I first executed the RF algorithm using all the numerical explanatory variables, then using feature importance analysis to remove variables that didn’t correlate, and then in cases of collinearities in the correlation matrix, removing the variable with the least score on the feature importance analysis.
# Independent variables
predictors <- c(
"lat",
"lng",
"attr_index_norm",
"dist",
"bedrooms",
"guest_satisfaction_overall",
"barcelona", "london", "host_is_superhost", "multi", "biz", "amsterdam"
)
data_subset <- combined_data[, c(predictors, "realSum")]
# Split the data into a training set and a testing set
set.seed(123) # For reproducibility
sample_index <- sample(1:nrow(data_subset), 0.7 * nrow(data_subset))
train_data <- data_subset[sample_index, ]
test_data <- data_subset[-sample_index, ]
# Train the Random Forest model
rf_model <- randomForest(realSum ~ ., data = train_data, ntree = 500)
# Make predictions on the test set
predictions <- predict(rf_model, test_data)
# Evaluate the model
rmse <- sqrt(mean((test_data$realSum - predictions)^2))
mae <- mean(abs(test_data$realSum - predictions))
# Print the evaluation metrics
cat("Root Mean Squared Error (RMSE):", rmse, "\n")
## Root Mean Squared Error (RMSE): 55.90214
cat("Mean Absolute Error (MAE):", mae, "\n")
## Mean Absolute Error (MAE): 40.32739
the RMSE & MAE are acceptable in comparison to our mean and median. So the model seems to be adequate
# Calculate R-squared
r_squared <- 1 - (sum((test_data$realSum - predictions)^2) / sum((test_data$realSum - mean(test_data$realSum))^2))
print(paste("R-squared (R²):", r_squared))
## [1] "R-squared (R²): 0.637456402396933"
# Calculate MAPE
mape <- mean(abs((test_data$realSum - predictions) / test_data$realSum)) * 100
print(paste("Mean Absolute Percentage Error (MAPE):", mape, "%"))
## [1] "Mean Absolute Percentage Error (MAPE): 18.1648715125016 %"
And our R squared and MAPE values are acceptable, though they are not ideal.
# Compare observed values and predicted values
plot(test_data$realSum, predictions,
xlab = "Observed Price",
ylab = "Predicted Price",
main = "Comparison of Observed and Predicted Prices",
col = "blue",
pch = 16)
# Add a diagonal reference line
abline(0, 1, col = "red")
Let’s take a look at the residuals plot to see if there is any pattern we can see in the error terms
# Calculate residuals
residuals <- test_data$realSum - predictions
# Plot residuals against predicted values
plot(predictions, residuals,
xlab = "Predicted Price",
ylab = "Residuals",
main = "Residual Plot",
col = "blue",
pch = 16)
# Add a horizontal reference line at y = 0
abline(h = 0, col = "red")
Let’s look into feature importance to see if we have any redundant explanatory variables
library(randomForest)
library(caret)
# Create the feature importance plot
importance <- importance(rf_model)
varImpPlot(rf_model, pch = 19, col = "blue", bg = "white", main = "Feature Importance")
In conclusion, in our samples there is considerable variance between cities in terms of listing prices. Our model shows this with the significant correlation of price with longitudes and latitudes, which incidentally was more correlated than the city dummy variables. The price is also correlated to the indexes relating to distance to restaurants and attractions, but since they are correlated to each other we only used one for the random forest model to prevent collinearity. And the same idea applies to the number of bedrooms and the distance to city centers.
Let’s perform K-fold cross-validation
# Create a data frame with only the selected predictors and the target variable (price)
data_subset <- combined_data[, c(predictors, "realSum")]
# Define the number of folds (K) for cross-validation
num_folds <- 5 # You can adjust this as needed
# Create a training control object for cross-validation
train_control <- trainControl(
method = "cv", # Use K-fold cross-validation
number = num_folds, # Number of folds
verboseIter = TRUE # Display progress
)
# Apply K-fold cross-validation to your existing rf_model
set.seed(123) # For reproducibility
cv_results <- train(
realSum ~ ., # Formula for the target variable
data = data_subset, # Data frame
method = "rf", # Random forest method
trControl = train_control, # Training control
tuneGrid = data.frame(mtry = 3) # Adjust mtry as needed
)
## + Fold1: mtry=3
## - Fold1: mtry=3
## + Fold2: mtry=3
## - Fold2: mtry=3
## + Fold3: mtry=3
## - Fold3: mtry=3
## + Fold4: mtry=3
## - Fold4: mtry=3
## + Fold5: mtry=3
## - Fold5: mtry=3
## Aggregating results
## Fitting final model on full training set
# Print the cross-validation results, including metrics
print(cv_results)
## Random Forest
##
## 41369 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 33096, 33096, 33095, 33093, 33096
## Resampling results:
##
## RMSE Rsquared MAE
## 56.81461 0.6472051 42.45537
##
## Tuning parameter 'mtry' was held constant at a value of 3
As we can see the results of this test reinforce the initial testing of the random forest model.